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Abstract 

Most speech enhancement algorithms make use of the short-time Fourier transform 
(STFT), which is a simple and flexible time-frequency decomposition that estimates the 
short-time spectrum of a signal. However, the duration of short STFT frames are inher¬ 
ently limited by the nonstationarity of speech signals. The main contribution of this paper 
is a demonstration of speech enhancement and automatic speech recognition in the presence 
of reverberation and noise by extending the length of analysis windows. We accomplish this 
extension by performing enhancement in the short-time fan-chirp transform (STFChT) do¬ 
main, an overcomplete time-frequency representation that is coherent with speech signals 
over longer analysis window durations than the STFT. This extended coherence is gained 
by using a linear model of fundamental frequency variation of voiced speech signals. Our 
approach centers around using a single-channel minimum mean-square error log-spectral 
amplitude (MMSE-LSA) estimator proposed by Habets, which scales coefficients in a time- 
frequency domain to suppress noise and reverberation. In the case of multiple microphones, 
we preprocess the data with either a minimum variance distortionless response (MVDR) 
beamformer, or a delay-and-sum beamformer (DSB). We evaluate our algorithm on both 
speech enhancement and recognition tasks for the REVERB challenge dataset. Compared to 
the same processing done in the STET domain, our approach achieves significant improve¬ 
ment in terms of objective enhancement metrics (including PESQ—the ITU-T standard 
measurement for speech quality). In terms of automatic speech recognition (ASR) perfor¬ 
mance as measured by word error rate (WER), our experiments indicate that the STFT 
with a long window is more effective for ASR. 


1 Introduction 

Enhancement and recognition of speech signals in the presence of reverberation and noise remains 
a challenging problem in many applications. Many past methods are prone to generating artifacts 
in the enhanced speech, and must trade off noise reduction against speech distortion. Recent 
approaches have started to address this issue, demonstrating improvements in both objective 
speech quality and automatic speech recognition mm- 

In this paper, we propose using a new time-frequency domain that is more coherent with 
speech signals over an extended period of time, which allows longer analysis windows. In turn, 
longer analysis windows provide a more narrowband spectral representation, which concentrates 
signal energy into smaller numbers of FFT bins. Within these bins, the signal-to-noise ratio 
(SNR) is increased, which results in less oversuppression of speech. We combine a statistically 
optimal single-channel enhancement algorithm that suppresses background noise and reverber¬ 
ation with an adaptive time-frequency transform domain that is coherent with speech signals 

*swisdom@uw.edu 


1 



over longer durations than the short-time Fourier transform (STFT). Thus, we are able to use 
longer analysis windows while still satisfying the assumptions of the optimal single-channel en¬ 
hancement filter. Multichannel processing is made possible using a classic minimum variance 
distortionless response (MVDR) beamformer or, in the case of two-channel data, a delay-and-sum 
beamformer (DSB) preceding the single-channel enhancement. 

First, we review the speech enhancement and dereverberation problem, as well as the en¬ 
hancement algorithm we use proposed by Habets [3], which suppresses both noise and late 
reverberation based on a statistical model of reverberation (originally proposed by Lebart et al. 
SI)- Then, we describe the fan-chirp transform, proposed by Weruaga and Kepesi mm and 
improved upon by Cancela et al. [7], which provides an enhancement domain, the short-time 
fan-chirp transform (STFChT), that better matches time-varying harmonic content of voiced 
speech. 

We discuss why performing the enhancement in the STFChT domain gives superior results 
compared to the STFT domain. Further improvements over our original submission [8] to the 
REVERB challenge [S] are described, and we explore more optimal parameter settings. We 
present both speech enhancement and recognition results on the REVERB challenge dataset 
[5] , which shows that our new method achieves superior results versus conventional STET-based 
processing in terms of objective speech enhancement measures. Through our automatic speech 
recognition (ASR) experiments, we discover that STFT-based processing with a longer window 
results in the lowest word error rates. Thus, our algorithm is an example of an operation 
that improves enhancement and objective quality metrics, but for reasons we hypothesize the 
operation does not improve ASR. However, our enhancement method may be able to provide 
complementary features to conventional STFT-based processing. 

Our basic multichannel (given multiple microphones) architecture of single-channel enhance¬ 
ment preceded by beamforming is not unprecedented. Cannot and Cohen [10] used a similar 
architecture for noise reduction that consists of a generalized sidelobe cancellation (CSC) beam- 
former followed by a single-channel post-filter. Maas et al. m employed a similar single-channel 
enhancement algorithm for reverberation suppression and observed promising speech recognition 
performance in even highly reverberant environments. 

There have been several dereverberation and enhancement approaches that estimate and 
leverage the time-varying fundamental frequency /o of speech. Nakatani et al. [T^ proposed a 
dereverberation method using inverse filtering that exploits the harmonicity of speech to build 
an adaptive comb filter. Kawahara et al. |13| used adaptive spectral analysis and estimates of 
/o to perform manipulation of speech characteristics. 

Droppo and Acero [14] observed how the fundamental frequency of speech can change within 
an analysis window, and proposed a new framework that could better predict the energy of 
voiced speech. Dunn and Quatieri [15] used the fan-chirp transform for sinusoidal analysis 
and synthesis of speech, and Dunn et al. m also examined the effect of various interpolation 
methods on reconstruction error. Pantazis et al. m proposed an analysis/synthesis domain 
that uses estimates of instantaneous frequency to decompose speech into quasi-harmonic AM- 
FM components. Degottex and Stylianou [18] proposed another analysis/synthesis scheme for 
speech using an adaptive harmonic model that they claim is more flexible than the fan-chirp, as 
it allows nonlinear frequency trajectories. 

Wisdom et al. showed that the fan-chirp transform can be used to build optimal detectors 
for nonstationary harmonics [19] and harmonically-modulated stationary processes with time- 
varying modulation frequency |20| . A preliminary version of this algorithm appeared in our 
REVERB challenge workshop paper [5|. To our knowledge, these recent papers are the first to 
use the fan-chirp transform for statistical signal processing. 


2 Background 

This section gives necessary background on single-channel suppression of noise and late rever¬ 
beration and on the window duration- and hence coherence-extending fan-chirp transform. 
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2.1 Optimal single-channel suppression of noise and late reverberation 

In this section, we review the speech enhancement problem and a popular statistical speech 
enhancement algorithm, the minimum mean-square error log-spectral amplitude (MMSE-LSA) 
estimator, which was originally proposed by Ephraim and Malah and later improved 

by Cohen [53]. We review the application of MMSE-LSA to both noise reduction and joint 
dereverberation and noise reduction. Joint dereverberation and noise reduction was proposed 
by Habets 0). 

2.1.1 Noise reduction using MMSE-LSA 

A classic speech enhancement algorithm is the minimum mean-square error (MMSE) short-time 
spectral amplitude estimator proposed by Ephraim and Malah m- They later refined the 
estimator to minimize the MSE of the log-spectra [55]. We will refer to this algorithm as LSA 
(log-spectral amplitude). Minimizing the MSE of the log-spectra was found to provide better 
enhanced output because log-spectra are more perceptually meaningful. Cohen [23] suggested 
improvements to Ephraim and Malah’s algorithm, which he referred to as “optimal modified 
log-spectral amplitude” (OM-LSA). 

Given samples of a noisy speech signal 

y[n] = s[n] + v[n], (1) 

where s[n] is the clean speech signal and v[n] is additive noise, the goal of an enhancement 
algorithm is to estimate s[n] from the noisy observations y[n]. Clean speech and noise are 
additive in the STFT domain: 


Y{d,k) = S{d,k) + V{d,k). (2) 

The LSA estimator yields an estimate A{d,k) of the clean STFT magnitudes |S'((i,/c)| (where 
S{d,k) are assumed to have a proper complex-valued Gaussian distribution) by applying a 
frequency-dependent gain G'LSA(d, fc) to the noisy STFT magnitudes \Y{d,k)\: 

A{d,k) = Gi^SA{d,k)\Yid,k)\. (3) 


Given these estimated magnitudes, the enhanced speech is reconstructed from STFT coefficients 
combining A{d, k) with noisy phase: 


S{d,k) = A{d,k)e^^^^‘^^'‘\ 
The LSA gains are computed as [5^ equation (20)]: 


GhSAid, k) = 


ad,k) 

i+^id,k) 


exp < ^ 


v(d,k) i \ 


The lower integral bound in ([^ is 


Ad,k) = 

1 -f C(d, fc) 


(4) 

(5) 

( 6 ) 


where ^(d, k) and 7 ((i, k) are the a priori and a posteriori signal-to-noise ratios (SNRs), respec¬ 
tively, for the fcth frequency bin of the dth frame. These SNRs are defined to be 



,,, ,, A A,(d,fc) ^ ,, ,,A\Y{d,k)\^ 

= K(d.k) 

(7) 

where 

Xs{d,k)=E{\S{d,k)\'^} 

(8) 

end 

XM,k) = E {\V{d,k)\'^} 

(9) 
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are the variances of S{d, k) and V{d, fc), respectively. 

Cohen [23] refined Ephraim and Malah’s approach to include a lower bound G„iin for the gains 
as well as an a priori speech presence probability (SPP) estimator p{d, k). Cohen’s estimator is 
as follows |23J equation (8)]: 

Gom-lsa = {GLSA(d, (lO) 

Cohen also derived an efficient estimator for the SPP p{d, k) [25] that exploits the strong inter¬ 
frame and interfrequency correlation of speech in the STFT domain. 

2.1.2 Joint dereverberation and noise reduction 

This subsection reviews a MMSE-LSA enhancement algorithm proposed by Habets [3] that 
uses a statistical model of reverberation to suppress both noise and late reverberation. Such a 
statistical model-based approach to dereverberation was originally proposed by Lebart et al. [1] . 
We will refer to this type of MMSE-LSA as HMMSE-LSA (for Habets MMSE-LSA). The signal 
model Habets uses is 


y[n] = s[n] * h[n] + v[n] = Xe[n] -I- Xf\n] + u[n], (11) 

where s[n] is the clean speech signal, h[n] is the room impulse response (RIR), and v[n] is additive 
noise. The terms Xe[n] and xi[n] correspond to the early and late reverberated speech signals, 
respectively. The partition between early and late reverberations is determined by a parameter 
rie, which is a discrete sample index. All samples in the RIR before rig are taken to cause early 
reflections, and all samples after Ue are taken to cause late reflections [3]. Thus, 

fo, if n < 0 

h[n] = < heN) if 0 < n < TT-e (12) 

[ hi[n] if Tie < n. 


Using these definitions, Xe{n] = s[n] * he[n] and X£[n] = s[n] * hi[n\. 

Habets proposed a generalized statistical model of reverberation that is valid both when 
the source-microphone distance is less than or greater than the critical distance [3]. This model 
divides the RIR h[n] into a direct-path component hd[n] and reverberant component hr[n\. Both 
direct-path and reverberant components are taken to be white, zero-mean, stationary Gaussian 
noise sequences bd[n] and br[n] with variances cr^ and scaled by an exponential decay, 

hd[n] = bd[n]e~^^ and hr[n] = br[n]e~^'^, (13) 

where is related to the reverberation time Teo by [3]: 


31n(10) 

Teofs 

Using this model, the expected value of the energy envelope of h[n] is 


(14) 


for 0 < n < rid 

for n>nd (15) 

0 otherwise, 


where rid is a parameter chosen to be the number of samples that correspond to the direct part 
of a reverberant signal. 

Figure shows a schematic illustration of this statistical model of reverberation. Under the 
assumption that the speech signal is stationary over short analysis windows (i.e., duration much 
less than Tqq), Habets proposed [3J equation (3.87)] the following model of the spectral variance 
of the reverberant component Xr{n\^ which is denoted by Xx^{d,k): 

=e-2f«^A,,(d-l,A:)... 
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Figure 1: Reverberation model, 
model given by equations (12)-(15). 


A schematic illustration of the statistical reverberation 


where R is the number of samples separating two adjacent analysis frames and E^/Ed is the 
inverse of the direct-to-reverberant ratio (DRR). The quantities E^. and Ed are the energies 
of the reverberant and direct components of the signal, respectively. The DRR expresses the 
energy level of the direct signal referenced to the energy level of the reverberant part. Thus, 
the spectral variance of the reverberant component in the current frame d is composed of scaled 
copies of the spectral variance of the reverberation and the spectral variance of the direct-path 
signal from the previous frame d — 1. 

Using this model, the variance of the late reverberant component can be expressed as [31 
equation (3.85)]: 

A,, (d, k) = (d - ^ + 1, fc) , (17) 

which is quite useful in practice, because the variance of the late-reverberant component can be 
computed from the variance of the total reverberant component. 

To suppress both noise and late reverberation, the a priori and a posteriori SNRs ^(d, k) and 
7 (d, k) from the previous section become a priori and a posteriori signal-to-interference ratios 
(SIRs), given by [31 equations (3.25-26)]: 


e(d,fc) = 

KAd, k) 

(18) 


7(d,fc) = 

|y(d,fc)|2 

(19) 

(^5 “t“ fc) 


The gains are computed by plugging the SIRs in ( [l^ and ( [l^ into © and (§. Habets suggested 
an additional change to (101, which makes Gmin time- and frequency-dependent. This is done 
because the interference of both noise and late reverberation is time-varying. The modihcation 
is [31 equation (3.29)] 


Gmin(d, k) = 


Gniin.a:^ Aa,^ (d, fc) -\- Grain,k') 

Aa;^(d, A:) Aa;(d, A:) 


( 20 ) 


Notice that two parameters in (14) and are not known a priori] namely, Tgo and the 
DRR. These parameters must be blindly estimated from the data. For Tgo estimation, Lollmann 
et al. [24] propose a maximum-likelihood algorithm, which we found to be effective. As for 
the DRR, Habets suggests an online adaptive procedure [31 §3.7.2]. This adaptive procedure 
constrains the DRR between 0 and 1 and assumes that the source is within the critical distance 
(i.e., the distance at which direct and reverberant energy are equal). This assumption prevents 
overestimation of the reverberant variance when the direct signal is active. 


2.2 Analysis using the forward fan-chirp transform 

In this section, we review the forward short-time fan-chirp transform (STFChT), which is used 
as the time-frequency analysis-synthesis domain for our enhancement algorithm. In section |2.31 
we describe our novel method of inverting the STFChT. 
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We adopt the fan-chirp transform formulation used by Cancela et al. [7]. The forward 
fan-chirp transform is defined as 




( 21 ) 


where (j)a{t) = (t + and = 1 -I- at. The variable a is an analysis chirp rate. The 

chirp rate a is a normalized chirp rate; that is, if the total bandwidth swept is B He rtz over 
a time duration T seconds, then a = Using a change of variable r = 4>a{i)i (21) can be 
written as the Fourier transform of a time-warped signal: 


/ OO 

-OO 


( 22 ) 


The short-time fan-chirp transform (STFChT) of x(t) is defined as the fan-chirp transform 
of the dth short frame of x{t)\ 


fT^/2 

Xd{f,ad)= / w(r)a;d((/>^](T))e"'’^’"'^^dr 

J-T,nf2 


/-T ^/2 


(23) 


where w{t) is an analysis window, ad (given by (27)) is the analysis chirp rate for the dth frame, 
and Xd{t) is the dth short frame of the input signal of duration T: 


Xd{t) 


x{t-dThop), -Tf2<t <Tf2 
0 , otherwise. 


(24) 


T is the duration of the pre-warped short-time duration, Thop is the frame hop, is the 
post-warped short-time duration, and w(t) is a Tu,-long analysis window. The analysis window is 
applied after time-warping so as to avoid warping of the window, which can cause unpredictable 
smearing of the Fourier transform. 

Implementing the fan-chirp transform as a time-warping followed by a Fourier transform 
allows efficient implementation, consisting simply as an interpolation of the signal followed by 
an FFT. In the implementation provided by Cancela et al. [7], the interpolation used in the 
forward fan-chirp transform is linear. 

Kepesi and Weruaga [5] provide a method for determination of the analysis chirp rate a 
using the gathered log spectrum (GLogS). The GLogS is defined as the harmonically-gathered 
log-magnitude spectrum: 

p(/o,a) = — ^ln|X(fc/o,a)| (25) 

^ k—1 


where Nh is the maximum number of harmonics that fit within the analysis bandwidth. That 


is. 


Nh 


fs 

2/o(l + i|a|r„) 


(26) 


Cancela et al. [7] proposed several enhancements to the GLogS. First, they observed improved 
results by replacing In | • | with In (1 -|- 7 j-l). Gancela et al. note that this expression approximates 
a p-norm, with 0 < p < 1 , where lower values of 7 with 7 > 1 approach the 1 -norm, while higher 
values approaches the 0-norm. Gancela et al. note that 7 = 10 gave good results for their 
application. 

Additionally, Cancela et al. propose modifications that suppress multiples and submultiples 
of the current /q. Also, they propose normalizing the GLogS such that it has zero mean and 
unit variance. This is necessary because the variance of the GLogS increases with increasing 
fundamental frequency. For means and variances measured over all frames in a database, a 
polynomial fit is determined and the GLogS are compensated using these polynomial fits. 

Let pd{fo,a) be the GLogS of the dth frame with these enhancements applied. For prac¬ 
tical implementation, finite sets A of candidate chirp rates and J 7 of candidate fundamental 
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frequencies are used, and the GLogS is exhaustively computed for every chirp rate in A and 
fundamental frequency in Aq. The analysis chirp rate dd for the dth frame is thus found by 

dd = argmax max pd{fo,a). (27) 

aeA fo&^o 

2.3 Synthesis using the inverse fan-chirp transform 

Inverting the fan-chirp transform is a matter of reversing the steps used in the forward trans¬ 
form. Thus, the inverse fan-chirp transform for a short-time frame consists of an inverse Fourier 
transform, removal of the analysis window, and an inverse time-warping. The removal of the 
analysis window w{t) from the Tu,-long warped signal limits the choice of analysis windows to 
non-zero functions only, such as a Hamming window, so the window can be divided out. Also, 
since the warping is nonuniform, it is possible that the sampling interval between points may 
exceed the Nyquist sampling interval. To combat the potential for aliasing, the data should be 
oversampled before time-warping, which requires downsampling after undoing the time-warping. 

The choice of post-warped duration and the method of interpolation used in the inverse 
time-warping affect the reconstruction error of the inverse fan-chirp transform. There is a trade¬ 
off between reconstruction performance and computational complexity, because interpolation 
error decreases as interpolation order increases. Kepesi and Weruaga [25] analyzed fan-chirp 
reconstruction error with respect to order of the time-warping interpolation and oversampling 
factor, and found that for cubic splines and an oversampling factor of 2, a signal-to-error ratio 
of over 30dB can be achieved. For our application, we choose an oversampling factor of 8 and 
cubic-spline interpolation. 


3 Proposed Method 

As discussed in the introduction, our main contribution is that we use the short-time fan-chirp 
transform as the analysis-synthesis domain for the HMMSE-LSA algorithm. In this section, 
we describe two aspects of our proposed method. First, we discuss the benefits of performing 
enhancement in the short-time fan-chirp domain. Next, we describe our method of iterative 
enhancement, which provides additional improvement to the processing. We go on to show how 
the parameters of iterative enhancement and analysis window duration affect our processing. 


3.1 Advantage of HMMSE-LSA in the Fan-Chirp Domain 


Unlike a conventional Fourier transform, the fan-chirp transform captures intra-window fre¬ 
quency variation. As a result, the fan-chirp transform better matches the frequency content of a 
harmonic signal and concentrates the signal’s energy into fewer bins. To illustrate this property, 
we perform a comparison of the local time-frequency SNRs of the STFChT and the STFT. Both 
transforms are applied to a simulated signal of two linear harmonic chirps in a simulated noisy 
and reverberant environment. The first chirp has a fundamental frequency varying from 200 Hz 
to 233 Hz, and the second chirp decreases from 250 Hz to 200 Hz. Both chirps last for 200 ms 
and have 20 harmonics. To simulate reverberation, we convolve the signal with a measured room 
impulse response (RIR) corresponding to the medium size room 2 far condition from the WSJ- 
CAMO speech corpus from the 2014 REVERB challenge dataset [3]. Recorded air conditioning 
noise from the same room is added at 20 dB SNR. 

Since we know the analytical form of the test signal, we know precisely which time-frequency 
bins contain direct signal. Convolving this known test signal with a measured RIR and adding 
actual recorded noise allows us to view the true local SNR in each time-frequency bin of the 
two transforms for realistic reverberation and additive noise. Given a time-frequency transform 
S{d, k) of the direct signal, the time-frequency transform Xr{d, k) of the reverberant signal, and 
the time-frequency transform V{d,k) of the noise, we compute local SNR in a time-frequency 
bin (d, k) as 


SNRiocaiid, k) 


|A,(d,“fc)|2 + |U(d,^fc)|2- 


(28) 
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Figure 2: Oracle local SNR values for a sequence of synthetic chirp signals. These oracle 
local SNRs illustrate the less smeared concentration of SNR within individual time-frequency 
bins for direct path signal in the STFChT (right) as compared to the STFT (left). 


Thus, we can observe this oracle local SNR for bins containing direct signal, noise, and rever¬ 
beration, and for bins containing only noise and reverberation. 

Figure shows these oracle SNR values for the STFT and the STFChT representations of 
the chirps. Figures and show empirical probability density functions (PDFs) for the SNR 
values under two cases: time-frequency bins containing direct signal, noise and reverberation, 
and bins containing only noise and reverberation. We designate direct bins as the ones in which 
the direct signal should ideally fall given our knowledge of the synthetic test signals, and the 
noisy/reverberant bins make up the remainder. 


Empirical SNR PDFs for Direct TF Points 



Figure 3: Empirical distribution of lo¬ 
cal SNR values in time-frequency bins 
containing direct signal. Given for the 
synthetic chirp signals in figure Notice 
that the STFChT provides a higher mean 
local SNR within time-frequency bins con¬ 
taining direct signal 


Empirical SNR PDFs for Noisy/Reverberant TF Points 



Figure 4: Empirical distribution of 
local SNR values in time-frequency 
points containing only noise and re¬ 
verberation. Given for the synthetic chirp 
signals in figure Notice that the STFT 
and STFChT have similar distributions for 
SNR in these signal-free time-frequency 
bins. 


As can be seen in the two plots in Figure the STFChT (right) appears to better lock on 
to the harmonics despite the noise and reverberation, whereas the STFT (left) smears out the 


8 





















energy in time and frequency. Figure shows that the expected SNR in STFChT bins is higher 
than the expected SNR in STFT bins, while figure shows that the distribution of the SNR 
in noisy and reverberant bins is unchanged from STFT to STFChT. The STFChT effectively 
partitions more direct signal power from noise and reverbation. Since HMMSE-LSA applies gains 
to individual time-frequency bins, the more the STFChT can partition direct signal power from 
noise and reverbation, the better performance will be. Thus, when a noise and reverberation 
dominated time-frequency bin is suppressed, less speech power is lost, and fewer speech artifacts 
are created. 

Moreover, concentrating direct-path signal power prevents HMMSE-LSA from over-suppressing 
the speech signals, which is a common problem when enhancing in the STFT domain. Cappe 
analyzed [55] how the original Ephraim and Malah LSA estimator tends to greatly reduce 
musical noise artifacts. Musical noise artifacts are an unnatural disturbance in speech enhanced 
using MMSE-LSA, and is caused by enhanced noise-only bands having spectral peaks that sound 
like random narrowband tones |26| . MMSE-LSA tends to have less artifacts than Wiener filtering 
or spectral subtraction. 

Cappe observed that a high a posteriori SIR 7 (d, k) causes more attenuation compared to 
a standard Wiener gain, especially when the a priori SIR ^{d,k) is small; ^{d,k) provides a 
“correction factor” when the ^(d, k) has been incorrectly estimated. 

Considering this observation, Cappe described two cases: 

1. 'y{d,k) < OdB, i.e. noise-dominated time-frequency bins: in this case, ^(d, fc) is a highly 
smoothed version of 7 (d, k). This smoothing eliminates spectral peaks in noise-only regions 

2. 7 (d, k) > OdB, i.e. speech-dominated time-frequency bins: in this case, ^(d, k) tends to 
follow 7 (d, k) with a one-frame delay. 

We have seen that the STFChT of a harmonic signal concentrates more direct signal energy 
into only a few bins as compared to the STFT. Thus, according to point 2 above, when only a 
few bins correspond to speech, in these bins the a priori SIR ^(d, fc) will closely follow j{d,k). 
Furthermore, since the SNR distribution in noise and reverberation-dominated bins is similar 
between the STFT and STFChT, the advantageous smoothing mentioned in point 1 will reduce 
spectral peaks and hence tonal artifacts. 


|STFT|^ of noisy reverberated 


|STFT|^ of clean direct+early 
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Figure 5: Spectrogram comparisons of STFT-based HMMSE-LSA to STFChT-based 
HMMSE-LSA. Upper left: noisy audio. Upper right: ideal clean signal with some early 
reflections, which is ground truth to be recovered. Lower left: spectrogram of enhancement 
using STFT-based HMMSE-LSA. Lower right: spectrogram of enhancement using STFChT- 
based HMMSE-LSA. The comparison between lower left and lower right shows that STFChT 
exhibits less over-suppresion of speech energy. 

An example of the STFChT providing less over-suppression is shown in figure The hgure 
shows a clip of a noisy, reverberated speech signal (upper left panel) using the same RIR and 
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noise as used for the synthetic chirps in figures through The upper right panel shows the 
direct signal plus early reflections that are desired to be recovered. STFT-based HMMSE-LSA 
processing exhibits over-suppression of direct speech energy (lower left), while the STFChT 
better preserves the direct speech signal (lower right). 


3.2 Iterative enhancement and parameter tuning 

Our enhancement method can be improved by subsequent iterations. Iterative enhancement 
proceeds by successively running our above algorithm multiple times on a noisy utterance and 
taking a weighted convex combination of these outputs. In general, the output of iterative 
enhancement is 

/ 

^zter[^] — ^ [^] 

i=l 

where iix[n] is the noisy single-channel audio j/[n] processed i times by an enhancement algo¬ 
rithm, / is the maximum number of iterations, and {ai}i-j are convex mixing weights (that is, 
the ai are nonnegative and X)i=i = !)■ particular, we found that performance was best 
improved using a convex combination of once- and twice-iterated processing; thus, we set 1 = 2. 
The second iteration of processing uses reverberation parameters estimated during the first iter¬ 
ation of processing (e.g., Tqq time). Iterative processing is done on single-channel data, and can 
serve as a post-filter for a beamformer. 

We performed experiments to tune the parameters of iterative enhancement. Our goal was 
not only to discover the optimal iterative mixing parameter a, but to also choose the best analysis 
window duration Ty^m- For 1 = 2, the convex weights are parameterized by a, with ai = a and 
02 = 1 — 0 , and 0 < o < 1. The degree of iterative enhancement is given by (1 — o), since a larger 
(1 — a) indicates more of the twice-processed audio in the output. To tune these parameters, we 
choose 30 random utterances from each of the 6 SimData conditions, which are all permutations 
of the three rooms (rooml, room2, and roomS) and two distances (near and far). We tried both 
STFT- and STFChT-based processing on these utterances. 

Figure]^ shows the PESQ and SRMR scores versus Ty,in and (1 — o). The results reveal an 
interesting trade-off between speech quality (measured by PESQ) and dereverberation (measured 
by SRMR): a higher degree of iteration results in more dereverberation, at the cost of speech 
quality. These results also demonstrate the ability of the STFChT to increase analysis window 
duration. For STFT processing, a window duration of 64 ms is optimal, while for STFChT 
processing, a window duration of 96 or 128 ms is optimal. 
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Figure 6: Performance of STFT-based and STFChT-based HMMSE-LSA versus 
degree of iteration and window length on development SimData. Plots illustrating 
the trade-off between speech quality (measured by PESQ) and dereverberation (measured by 
SRMR). In general, the STFChT-based method achieves superior speech quality and derever¬ 
beration. 

To discover the optimal trade-off between speech quality and dereverberation, we perform 
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a minimum variance combination (MVC) of the PESQ and SRMR scores. This combination is 
given by 

C = (1 - c) • SRMR + c • PESQ (30) 

where 

c = argmin [(1 — c) • SRMR^ + c • PESQJ^ 

^ i 

where i runs over the indices of all combinations of and (1 — a) that are being tested. This 
produces the minimum variance combination of PESQ and SRMR, which takes into account the 
correlation between the two measures and their variances. 

For STFT-based HMMSE-LSA (top panels), shorter windows {T^m = 48 ms or 64 ms) tend 
to give the best PESQ/SRMR values, while for STFChT-based HMMSE-LSA (bottom panels), 
longer windows {T^in = 96 ms or 128 ms) tend to give better results. In general, a higher 
degree of iteration ((1 — a) = 1) provide better suppression of reverberation, at the expense of 
speech quality. An iteration degree of (1 — a) = 0.3 yields the best PESQ score. An optimal 
trade-off between PESQ and SRMR, as measured by the MVC between them, is N^in = 96 ms 
and (1 — a) = 0.7 (lower right). Overall, STFChT Habets achieves higher objective scores on 
both PESQ and SRMR. 

Using the information above, we reprocessed the REVERB SimData using a window duration 
of 96 ms, and degrees of iteration of (1 — a) = 0.3 and (1 — a) = 0.7. A degree of iteration of 
(1 — a) = 0.3 performed best out of these two (a degree of iteration of (1 — a) = 0.7 gave worse 
objective metrics, except for SRMR). These best scores are shown in tablesand 

4 Implementation 

Our algorithms are implemented in MATLAB, and we use utterance-based processing. The 
algorithm starts by using the utterance data to estimate the Tgo time of the room using the 
blind algorithm proposed by Lollmann et al. [24] . Multichannel utterance input data is concate¬ 
nated into a long vector, and as recommended by Lollmann et ah, noise reduction is performed 
beforehand. We use Loizou’s implementation m of Ephraim and Malah’s LSA [52] for this 
pre- enhancement. 

Figure shows empirically-estimated probability density functions (PDFs) of the Tgo esti¬ 
mation performance using this approach. These plots show that Tgo estimation [24j precision 
generally improved with increasing amounts of data (i.e., with more channels), although for some 
conditions Tgo estimates were inaccurate. Vertical dashed lines indicate approximate ^60 times 
given by REVERB organizers |5]. 

4.1 Spatial processing for multichannel data 

For multichannel data, we estimate the direction of arrival (DOA) by cross-correlating oversam¬ 
pled data between channels. That is, we compute a W^-length vector of time delays d with 
di = 0 and di, i=2,...,Nch given by 


, dJA;] 

di = argmax , (32) 

k U Js 

where rii[k] = — k], U is the oversampling factor, and c = 340 meters per second, 

the approximate speed of sound in air. 

Given a time delay vector d, the DOA estimate is given by the solution to Pa = ^d, where 
a is a 3 X 1 unit vector representing the estimated DOA of the speech signal and P is a Nch x 3 
matrix containing the Cartesian (x, y, z) coordinates of the array elements. For example, for 
an eight-element uniform circular array, Pu = Xi = rcos(i7r/4), Pi 2 = yi = rsin(i7r/4), and 
Pa = Zi = Q for i = 0,1,..., 7, where r is the array radius. 

For the 8-channel case, the estimated DOA is used to form the steering vector v^(/) for 
a frequency-domain minimum variance distortionless response (MVDR) beamformer applied to 
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Figure 7: Performance of blind Tgo estimation algorithm Sample probability density 
functions of estimated Tgo time measured on SimData evaluation dataset (these results were not 
used to tune the algorithm). For each condition, left plot is for 1-channel data, center plot is for 
2-channel data, and right plot is for 8-channel data. 


the multichannel signal. The weights w^(d,/) for the MVDR are [28l equations (6.14-15)] 




idj) = 




.^(d,/)Syyl(d,/)v(d,/)’ 


(33) 


where Syy{d, f) is the spatial covariance matrix at frequency / and frame d estimated using N 
snapshots Y{d — n, f) for —N/2 < n < N/2 and v is given by 


v(/) = exp ( j^Pa 


(34) 


Our MVDR implementation uses a 512-sample long Hamming window with 25% overlap, a 512- 
point FFT, and = 24 snapshots for spatial covariance estimates. For 2-channel data, we use 
a delay-and-sum beamformer to enhance the signal with the delay given by the DOA estimate. 
Single-channel data is enhanced directly by the single-channel HMMSE-LSA algorithm. A block 
diagram of these three cases is shown in figure 


4.2 Time-frequency analysis-synthesis 


We tried three analysis-synthesis domains for the HMMSE-LSA enhancement algorithm: the 
STFT with a short window, the STFT with a long window, and the STFChT. The STFT 
with a short window uses 512-sample long (T = 32ms) Hamming windows, a frame hop of 
128 samples, and an FFT length of 512. Short-window STFT processing is chosen to match 
conventional speech processing window lengths. The STFT with a long window uses 2048-sample 
long (T = 128ms) Hamming windows, a frame hop of 128 samples, and an FFT length of 3262. 
Long-window STFT processing is intended to match the parameters of STFChT processing for 
a direct comparison. STFChT processing uses an analysis duration of 2048 samples, a Hamming 
analysis window, a frame hop of 128 samples, an FFT length of 3262, oversampling factor of 8, 
and a set of possible analysis chirp rates A consisting of 21 equally spaced as from -4 to 4. 


The forward STFChT, given by (23), proceeds frame-by-frame, estimating the optimal anal¬ 


ysis chirp rate ad using (27), oversampling in time, warping, applying an analysis window, and 


taking the FFT. Then HMMSE-LSA weights are estimated frame-by-frame and applied in the 
STFChT domain, and the enhanced speech signal is reconstructed using the inverse STFChT. 
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Figure 8: Block diagrams of processing For 8-channel data using a minimum variance distor¬ 
tionless response (MVDR) beamformer (top), 2-channel data using a delay-and-sum beamformer 
(DSB, middle), and 1-channel data (bottom). 


For all methods, noise estimation is performed with a decision-directed method and simple 
online updating of the noise variance. Voice activity detection to determine if a frame is noise- 
only or speech-plus-noise is done using Loizou’s method, which compares the following quantity 
to a threshold ??thresh: 




(35) 


If ri{d) < 77thresh) the frame is determined to be noise-only and the noise variance is updated as 
Xv(d, k) = ^vXy{d — l,k) + {1 — iJ,v)\y {d, fc)P, with = 0.98 and ??thresh = 0.15. 

For our implementation of Habets’s joint dereverberation and noise reduction algorithm, we 
used Loizou’s implementation [27] of Ephraim and Malah’s LSA logmmse MATLAB algorithm 
as a foundation. The forward STFChT code was written by Cancela et al. [7]. We wrote our 
own MATLAB implementation of the inverse STFChT. 

For 8-channel data, the MVDR and the STFChT require the most computation. For 1- 
channel and 2-channel data, the STFChT requires the most computation. For the STFChT, 
most of the computation is used to compute the GLogS for estimation of the analysis chirp rate 


dd (271 for each frame. Note that this computation could be easily parallelized in hardware. 


5 Experiments 

We compare the effectiveness of using the STFT or the STFChT as the analysis-synthesis domain 
for HMMSE-LSA algorithm described in section 2.1.2 The tasks are the two tracks of the 


REVERB challenge: speech enhancement and automatic speech recognition. 

We evaluate our algorithms on the REVERB challenge dataset |9] . The data consists of both 
simulated and real reverberated speech. Simulated data (SimData) are created by convolving 
utterances from the Wall Street Journal Cambridge read news (WSJCAMO) corpus [33] with 
measured room impulse responses for three different reverberant rooms and at two distances: a 
near distance of about 0.5 meters and a far distance of about 2 meters. Recorded air conditioning 
noise is added at about 20dB signal-to-noise ratio (SNR). Real data (RealData) are actual 
recordings of male and female speakers from the multichannel Wall Street Journal audio-visual 
(MC-WSJ-AV) corpus [SD] reading prompts in a noisy (air conditioning noise at about 20dB 
SNR) and reverberant room, at two distances: a near distance of 1 meter and a far distance of 
2.5 meters. 

A summary table of our results is shown in table [^ for single- and eight-channel data. Eor 
single-channel data, the top part of table [T] shows that STFChT processing yields superior en¬ 
hancement results, but long-window {Ty,in = 128 ms) STFT processing yields superior recogni- 
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tion results. In the bottom part of table[^ results for eight-channel data indicate that performing 
multichannel STFChT processing generally yields superior enhancement as compared to STFT 
processing. For recognition, STFT processing with a long window achieves the lowest WERs. 


5.1 Speech Enhancement Results 

We score the enhanced audio using the same metrics used for the REVERB challenge, which 
includes segmental frequency-weighted SNR (FWSegSNR), cepstral distance (CD), source-to- 
reverberation modulation ratio (SRMR) [3T], log likelihood ratio (LLR), and perceptual eval¬ 
uation of speech quality (PESQ) |32]. All of these metrics are intrusive (meaning that they 
required clean reference signals) except for SRMR, which is the only non-intrusive metric. Since 
RealData does not have clean reference signals, SRMR is the only metric that can be run on 
RealData. Note that the precision of the scores reported is possibly lower than the precision 
implied by the number of significant digits reported. Eor consistency with the work of others, 
we chose to have our table entries match the precision used by the REVERB challenge result^ 




|Orig 

|lch STFT 32ms 
|lch STFT 128ms 
|lch STFChT 128ms 
jlch STFChT iO.3 96ms 
i2ch STFT 32ms 
i2ch STFT 128ms 
i2ch STFChT 128ms 
i2ch STFChT iO.3 96ms 
jSch MVDR 
jSch STFT 32ms 
|8ch STFT 128ms 
|8ch STFChT 128ms 
i8ch STFChT iO.3 96ms 


Figure 9: PESQ and SRMR results for SimData evaluation set Upper plots are near 
distance condition, lower plots are far distance condition. “iO.3” indicates iterative enhancement 
with (1 — a) = 0.3. 


RealData SRMR 



Figure 10: SRMR results for RealData evaluation set Same legend as figure 


^reverb2014.dereverberation.com/result.se. html 
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Our results on REVERB evaluation data are shown in figures and 10 and tables and 
Tables and also include computation times in terms of the real-time factor (RTF), which 
we define as total processing time divided by total data time. We choose to display PESQ 
(Perceptual Evaluation of Speech Quality) [35] and SRMR (source-to-reverberation modulation 
energy ratio) |31j more prominently because the former is the ITU-T standard for voice quality 
testing [33] and the latter is both a measure of dereverberation and the only non-intrusive 
measure that can be run on RealData (for which the clean speech is not available). 

For SimData, STFChT-based enhancement always performs better in terms of PESQ than 
STFT-based enhancement using either a short (512-sample) window or a long (2048-sample) 
window, for the 8-, 2-, and 1-channel cases (except for 8-channel, far-distance data in room 3). 
Informal listening tests revealed an oversuppression of speech and some musical noise artifacts 
in STFT processing, while STFChT processing did not exhibit oversuppression or musical noise 
artifacts. The oversuppression of direct-path speech by STFT processing can be seen in the spec¬ 
trogram comparisons shown in figure]^ In terms of SRMR, STFChT processing yields equivalent 
or slightly worse SRMR scores than long-window STFT processing for the 8-, 2-, and I-channel 
cases (except for 8-channel, near-distance data, where STFChT processing does slightly better). 
Informal listening indicated that although STFT processing reduced reverberation more, it came 
at the cost of oversuppression of speech. One issue with these SRMR comparisons, however, is 
that the variance of the SRMR scores is quite high. Thus, for SimData, STFChT processing 
achieves better perceptual audio quality while still achieving almost equivalent dereverberation 
compared to STFT processing. 


5.2 Automatic Speech Recognition Results 

For ASR experiments, we use the GMM-HMM recognizer implemented in KaldQ by Weninger 
et al. [51]. The front-end of the ASR concatenates nine adjacent frames of 13 Mel-frequency 
cepstral coefficients (MFCCs) each and uses linear discriminant analysis (LDA) and semi-tied 
covariance (STC) [35] to reduce these features down to 40 dimensions. The recognizer includes 
per-utterance feature-based maximum likelihood linear regression (IMLLR) for adaptation and 
uses minimum Bayes risk (MBR) for decoding. Optional discriminative training is performed 
using boosted maximum mutual information (bMMI). Tuning the language model weight and 
beam-width further optimizes the decoding. 

We use HMMSE-LSA in the STFT and STFChT domains to enhance reverberant and noisy 
data before feeding the enhanced audio to the recognizer. Unlike Weninger et ah, we found that 
using noisy multicondition training data with enhanced audio could improve WER versus using 
noisy multicondition training data with noisy audio. However, the lowest WERs occurred when 
the recognizer was trained with pre-enhanced noisy multicondition data (pre-enhanced with the 
single-channel part of the corresponding enhancement algorithm) and run on enhanced audio. 

To show the effect of various recognizer optimizations, recognition results are shown in tables 
1^ and We show two decimal places to be consistent with REVERB challenge results]^ For 
both development and evaluation data, HMMSE-LSA with a long-window STFT (Twin = 128 
ms) performed best for both 8-channel and single-channel data. 

It is interesting that STFT-based enhancement yields better ASR performance over STFChT- 
based enhancement, especially since STFChT-based enhancement achieves better objective en¬ 
hancement scores. We hypothesize that the better ASR performance using STFT-based enhance¬ 
ment results from the STFChT adding distortions to vocal tract dynamics. Though the STFChT 
concentrates harmonic signal energy for voiced speech, which results in better enhancement as 


discussed in section 3.1 this concentration of energy comes with the trade-off of distortion to 
the spectral envelope of the windowed frame, with distortions increasing with increasing chirp 
rates. Such distortions of the spectral envelopes result in less discriminative ASR features, thus 
increasing phone error rate, and in turn word error rate. 


^www.nmik. ei. turn. de/“wen/REVERB_2014/kaldi_baseline. tar .gz 
®reverb2014.dereverberation.com/result.asr.html 
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6 Conclusion 


In this paper, we have demonstrated the advantages of a new transform domain for speech 
enhancement: the short-time fan-chirp transform (STFChT). By estimating linear fits in the 
instantaneous fundamental frequency of voiced speech signals, the STFChT is more coherent 
with speech signals over longer durations, which allows extension of analysis window duration. 
In turn, this increased window duration concentrates more direct-path signal into time-frequency 
bins, which enables superior enhancement results in terms of objective metrics like PESQ and 
SRMR. We also performed ASR experiments on both STFT- and STFChT-based enhancement. 
Interestingly, despite better objective enhancement scores, we observed that long-window (128 
ms) STFT processing yielded the lowest WERs. 

The utility of the STFChT warrants further investigation. Interesting future directions in¬ 
clude moving beyond linear models of instantaneous frequency. Combinations of the STFChT 
and other coherence-extending transforms with deep neural network (DNN) enhancement and 
recognition methods could yield further performance improvements. 
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Table 1: Summary of speech enhancement and ASR results on single- and eight-channel RE¬ 
VERB evaluation data (SimData/RealData, RealData results given when applicable). Arrows 
indicate whether a higher or lower metric is better. 


Beamforming, 

TF type, 

Window duration 

Mean 1 Med. 
CD [dB] 

(i) 

SRMR 

(t) 

Mean 1 Med. 
LLR 

(i) 

Mean 1 Med. 
FWSegSNR 
[dB] (t) 

PESQ 

(t) 

WER 

[%] 

4) 

No enh. 

3.97|3.68 

3.68/3.18 

0.57|0.51 

3.62|5.39 

1.48 

11.97/30.27 

None, 

STFT, 

32 ms 

3.87|3.48 

4.79/5.80 

0.68|0.58 

6.72|7.62 

1.53 

12.32/33.37 

None, 

STFT, 

128 ms 

3.84|3.51 

4.28/4.21 

0.54|0.47 

4.65|6.71 

1.59 

10.20/28.23 

None, 

STFChT, 

128 ms 

3.57|3.07 

4.55/4.85 

0.57|0.49 

7.07|8.60 

1.69 

11.21/32.03 

8ch MVDR, 

No enh. 

3.15|2.81 

3.96/4.03 

0.44|0.38 

5.95|8.45 

1.80 

8.82/21.68 

8ch MVDR, 
STFT, 

32 ms 

3.56|3.23 

4.77/6.90 

0.61|0.50 

8.06|8.47 

1.83 

9.84/32.19 

8ch MVDR, 
STFT, 

128 ms 

3.18|2.83 

4.56/5.31 

0.43|0.38 

6.79|9.31 

1.94 

7.62/19.84 

8ch MVDR, 
STFChT, 

128 ms 

2.97|2.49 

4.82/6.33 

0.43|0.37 

9.21|10.63 

2.10 

8.18/22.21 

8ch MVDR, 
Iterated STFChT, 

(1 — a)=0.3, 96 ms 

3.33 1 2.78 

5.03/6.78 

0.44|0.38 

9.37|10.54 

2.14 

8.18/22.21 
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Table 2: Results for SimData evaluation set. 

SimData summary 


Ch. 

Method 

Comp, time 

Mean CD 

Median 

SRMR 

Mean 

Median 

Mean 

Median 

PESQ 

(RTF) 

CD 

LLR 

LLR 

FWSegSNR FWSegSNR 


Orig 

— 

3.97 

3.68 

3.68 

0.57 

0.51 

3.62 

5.39 

1.48 

8 

STFT 

32ms/128ms 

2.59 / 2.65 

3.56 / 3.18 

3.23 / 
2.83 

4.77 / 
4.56 

0.61 / 
0.43 

0.50 / 
0.38 

8.06 / 
6.79 

8.47 / 9.31 

1.83 / 1.94 

8 

STFChT 128ms 

5.97 

2.97 

2.49 

4.82 

0.43 

0.37 

9.21 

10.63 

2.10 

8 

STFChT iO.3 
96ms 

8.56 

3.06 

2.57 

5.03 

0.44 

0.38 

9.37 

10.54 

2.14 

2 

STFT 

32ms/128ms 

0.68 / 0.70 

3.80 / 3.57 

3.42 / 
3.22 

4.86 / 
4.47 

0.65 / 

0.49 

0.55 / 

0.44 

7.26 / 
5.46 

7.93 / 7.86 

1.60 / 1.66 

2 

STFChT 128ms 

2.87 

3.33 

2.83 

4.75 

0.51 

0.45 

7.68 

9.19 

1.77 

2 

STFChT iO.3 
96ms 

5.47 

3.37 

2.84 

5.04 

0.51 

0.44 

8.06 

9.32 

1.81 

1 

STFT 

32ms/128ms 

0.35 / 0.37 

3.87 / 3.84 

3.48 / 
3.51 

4.79 / 
4.28 

0.68 / 

0.54 

0.58 / 

0.47 

6.72 / 
4.65 

7.62 / 6.71 

1.53 / 1.59 

1 

STFChT 128ms 

2.60 

3.57 

3.07 

4.55 

0.57 

0.49 

7.07 

8.60 

1.69 

1 

STFChT iO.3 
96ms 

5.19 

3.59 

3.06 

4.83 

0.57 

0.49 

7.57 

8.89 

1.72 


Table 3: Results for RealData evaluation set. 

RealData summary 


Ch. 

Method 

Comp, time 
(RTF) 

SRMR 


Orig 

— 

3.18 

8 

STFT 

32ins/128ins 

2.54 / 2.60 

6.90 / 5.31 

8 

STFChT 128ms 

4.32 

6.33 

8 

STFChT iO.3 

96ms 

6.59 

6.78 

2 

STFT 

32ms/128ms 

0.70 / 0.77 

6.29 / 4.57 

2 

STFChT 128ms 

2.51 

5.24 

O 

STFChT iO.3 

4.78 

5.85 

Z 

96ms 

1 

STFT 

32ms/128ms 

0.50 / 0.56 

5.80 / 4.21 

1 

STFChT 128ms 

2.27 

4.85 

1 

STFChT iO.3 

96ms 

4.54 

5.45 
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Table 4: ASR results for REVERB development set using the Kaldi baseline recognizer by 
Weninger et al. [33]. Results are word error rates (WERs) in % for SimData/RealData. Beam¬ 
forming describes the spatial processing used, time-frequency (TE) type describes the analysis- 
synthesis domain for Habets enhancement, and multicondition training (MCT) type indicates 
what kind of multicondition training data was used. All results use per-utterance feature-based 
maximum likelihood linear regression (fMLLR) for adaptation and minimum Bayes risk (MBR) 
for decoding. Optional discriminative training is performed using boosted maximum mutual 
information (bMMI). Optimized decoding refers to optimizing language model weight and beam- 
width. 


Beamforming, 
TF type, 

MCT type 

MCT 

MCT 

Clean trained MCT +bMMI 

+bMMI ^ ^ 

+optim.izea decoding 

None 

33.21/77.78 14.88/34.35 11.99/30.50 11.31/30.72 

8ch MVDR, 

No enh., 

Noisy MCT 

16.11/53.64 11.01/26.57 8.21/24.12 7.91/23.91 

8ch MVDR, 
STFT 32ms, 
Noisy MCT 

30.33/63.95 14.52/33.63 10.10/31.80 9.84/32.19 

8ch MVDR, 
STFT 128ms, 

Enhanced MCT 

12.06/40.81 9.79/24.91 7.63/22.21 7.31/22.31 

8ch MVDR, 
STFChT 128ms, 
Noisy MCT 

13.95/51.30 11.17/30.04 10.09/29.94 9.74/29.86 

8ch MVDR, 
STFChT 128ms, 

Enhanced MCT 

13.95/51.30 10.02/29.34 8.34/27.76 7.96/ 27.98 
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Table 5: ASR results for REVERB evaluation set using the GMM-HMM Kaldi baseline recog¬ 
nizer by Weninger et al. [M] . Same format as table 


Beamforming, 
TF type, 

MCT type 

Clean trained 

MCT 

MCT 

-HbMMI 

MCT 

+hMMl 

-|-optimized decoding 

None 

32.77/77.68 

15.03/33.96 

12.45/30.23 

11.97/30.27 

8ch MVDR, 

No enh.. 

Noisy MCT 

17.50/54.14 

11.72/25.72 

8.95/21.96 

8.82/21.68 

8ch MVDR, 
STFT 32ms, 
Noisy MCT 

28.49/61.61 

12.87/29.30 

10.32/27.13 

10.14/26.93 

8ch MVDR, 
STFT 32ms, 

Enhanced MCT 

12.86/41.38 

10.29/22.34 

7.84/19.71 

7.62/19.84 

8ch MVDR, 
STFChT 128ms, 
Noisy MCT 

14.61/46.70 

11.54/27.89 

10.01/24.23 

9.86/23.99 

8ch MVDR, 
STFChT 128ms, 

Enhanced MCT 

14.61/46.70 

10.06/25.34 

8.35/22.77 

8.18/ 22.21 
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